feat: Build cached anchor/alias node list for faster alias resolving#612
feat: Build cached anchor/alias node list for faster alias resolving#612eemeli merged 4 commits intoeemeli:mainfrom PolyMeilex:faster-alias-resolution
Conversation
eemeli
left a comment
There was a problem hiding this comment.
I can see how this might be helpful. I'd much prefer for most of the work to be done in the Alias resolve() method, though. See inline for further comments.
Would you be able to share some YAML file with which the performance benefits of the change become measurable?
Had to anonymize it, so it looks kinda goofy, but here is one of the worst offenders in my data set: https://pastebin.com/nPB4hLyT (usually my files are a lot smaller than this, but with each loaded file it adds up quickly) |
eemeli
left a comment
There was a problem hiding this comment.
Getting there, I think. And thank you for the sample data, I'll need to find a bit of time to test how much this improves the performance on it.
Not a proper scientific benchmark, but with the example file this is the difference on my setup:
Also just as a fun fact, caching all nodes without any filtering still gives a significant improvement as we skip the visitor's overhead: |
|
We could also build an anchors map where a key would be the anchor's name and the value would be the node with all its child aliases already resolved (or maybe already JSified). Then if a duplicated anchor name is found it would simply replace the previous one, but I'm assuming that there is some reason why that's not how it's done already, that I'm not aware of. |
There was a problem hiding this comment.
Thank you @PolyMeilex for your contribution!
With you test file, this does provide something like a 12x improvement when I test it locally, though processing it does require adjusting maxAliasCount to avoid detecting the input as a resource exhaustion attack.
We could also build an anchors map where a key would be the anchor's name [...] but I'm assuming that there is some reason why that's not how it's done already, that I'm not aware of.
That's effectively ctx.anchors, but it's not as simple as one might like. The YAML spec allows for fun structures like this:
&foo
key:
- *foo
- &foo 42
- *fooHere, the resolution of the top-level map depends on resolving all of its contents, and that includes resolving the first *foo as a circular reference to the map itself, as well as resolving the second *foo as the scalar 42, thanks to the anchor re-use.
Getting that to work right was a bit tricky.
Oh, yeah that's fun, would rather pay the price of wasteful iteration then try to implement this 😅 Thank you for being open to this! I will try to find more low hanging fruit so hopefully we can keep generators, while still making the package usable for my usecase 🚀 |
Currently, to resolve an alias, we have to visit the whole document tree until the alias node is found. The lower in the file we put an alias, the slower the resolution is.
This is an attempt to speed this process significantly by:
Not sure if this is the right approach, but it helped me reduce the load time of my dataset, which is not very alias-heavy, from 10s to 4s.